home *** CD-ROM | disk | FTP | other *** search
- This document contains some technical information about SEEK. Programmers
- who want to write their own programs to perform operations on the text, will
- find pertinent information here.
-
- Memory Problems
- ===============
-
- In order that this program be useable on 1Mb machines, I have set the
- WimpSlot parameter quite tight. I'm a little worried that some
- configurations might occasionally run out of memory. If you do encounter
- memory related errors like "Too many nested procedures" or "no room for
- procedure call", then edit the !Run file and increase the WimpSlot
- parameter.
-
- The compression algorithm.
- ==========================
-
- Firstly there is a file (WORDSORT) which contains all the words used in the
- entire set of text files, sorted into alphabetical order. It also contains
- the number of occurrences of each word.
-
- You can read WORDSORT like:-
-
- file%=OPENIN("WORDSORT")
- REPEAT
- INPUT#file%,word$,count%
- . . .
- UNTIL EOF#file%
-
- The compressed file contains a number of 16-bit tokens, which can be one of
- five types:-
-
- Punctuation mark (after word):
- bits 0-1: Word Type = 3 (special)
- bits 2-7: Marker = 0
- bits 8-15: Punctuation Character in ASCII
-
- Punctuation mark (before word):
- bits 0-1: Word Type = 3 (special)
- bits 2-7: Marker = 1
- bits 8-15: Punctuation Character in ASCII
-
- Verse mark:
- bits 0-1: Word Type = 3 (special)
- bits 2-7: Marker = 2
- bits 8-15: Verse Number
-
- Chapter mark:
- bits 0-1: Word Type = 3 (special)
- bits 2-7: Marker = 3
- bits 8-15: Chapter Number
-
- Word Token:
- bits 0-1: Word Type
- 0 = lower case
- 1 = Initial Capital
- 2 = ALL UPPER CASE
- bits 2-15: Word Number
-
-
- Whenever a new chapter starts, there will be a chapter token.
-
- Whenever a new verse starts, there will be a verse token.
-
- Any characters other than alphabetic characters and apostrophies are
- represented by punctuation tokens. There are two classes of punctuation
- tokens, punctuation that follows a word, e.g.
- THIS, THAT! THESE.
- and punctuation that precedes a word, e.g.
- "THIS (THAT
-
- The actual words of the text are represented by a 14-bit word number. The
- maximum word number is 16,384. E.g. word #1 is "a", word #2 is "aaron", word
- #3 is "aaron's". If the word appears with a leading capital letter, then the
- word type is set to 1. If all the letters in the word are upper case, then
- the word type is set to 2. Word type 3 is for non-word tokens.
-
- The words are always separated by spaces. There is no token for space, since
- the program knows that there is always a space after a word. The position of
- the space is after all the type-0 punctuation and before any type-1
- punctuation. This means that when decoding the text, you need to look at the
- next token before you can decide if a space is required at the current
- position.
-
-
- Let's look at an example:
-
- They said unto him, Rabbi, (which is to say, being interpreted, Master,)
- where dwellest thou?
-
- This becomes
- WORD CAPITALISED They
- WORD said
- WORD unto
- WORD him
- PUNCTUATION AFTER ,
- WORD CAPITALISED Rabbi
- PUNCTUATION AFTER ,
- PUNCTUATION BEFORE (
- WORD which
- WORD is
- WORD to
- WORD say
- PUNCTUATION AFTER ,
- WORD being
- WORD interpreted
- PUNCTUATION AFTER ,
- WORD CAPITALISED Master
- PUNCTUATION AFTER ,
- PUNCTUATION AFTER )
- WORD where
- WORD dwellest
- WORD thou
- PUNCTUATION AFTER ?
- VERSE MARK
-
- So the text is compressed from the 94 bytes of plain text (plus a bit for
- he chapter & verse numbers) to 24 16-bit tokens, i.e. 48 bytes.
-
-
-
- The main objective of this compression is to improve word search speed. This
- is achieved as follows:-
-
- Suppose we want to find all verses containing both "Rabbi" and
- "interpreted". We proceed as follows:-
-
- Look up "rabbi" and "interpreted" in the word list. The word "rabbi"
- occurs 8 times, and "interpreted" occurs 11 times. Choose the least
- frequent word to be the primary search parameter, in this case "rabbi".
- The word number of "rabbi" is 1409.
-
- Load each file in turn into memory, and scan it.
-
- Look at each 16-bit token. If the word type is not 3, and the word number
- is 1409 then we have a match. It could be "rabbi", "Rabbi" or "RABBI"
- depending on the word type.
-
- Keep track of the last chapter and verse tokens during this search.
-
- When we find a verse containing "rabbi", jump back to the start of the
- verse, and scan the verse for word number 921 ("interpreted").
-
- Only decompress the text once all the search keys have been satisfied.
-
- Suppose we want to find all verses containing both "archimedes" and
- "computer".
-
- Look up "archimedes" and "computer" in the word list. "archimedes" is not
- in the word list at all. Therefore, the word "archimedes" can't be in the
- text, so there's no point looking for it.
-
- Reply immediately: SEARCH COMPLETE - NO OCCURRENCES FOUND.
-
-
- Timing Considerations
- =====================
-
- A 3-word search takes almost exactly the same time as a 1-word search that
- produces the same number of hits.
-
- The more hits that are found, the longer it takes, because we decompress the
- verse when we have a hit, and update the progress window.
-
- The primary search algorithm is written in ARM code, but everything else is
- written in BASIC.
-
- In Mode-27, on an A5000, using an IDE hard disk, a scan of the whole Bible
- takes 21 seconds.
-
- Each verse found adds 0.07 seconds.
-
-
- Memory Considerations
- =====================
-
- I want it to be usable on a 1 Mb machine. In order to achieve this, I'm
- limiting the maximum number of output lines to 1000.
-
- Each verse can occupy several lines. 1000 output lines might be about 400
- verses.
-
-
- Configuration File
- ==================
-
- With the data files, there is a file called "Config" this contains
- information about the special effects and the files to be searched.
-
- Special Effects
- ===============
-
- There must always be three special effects. Each effect is controlled by
- three lines of information, these lines contain:-
-
- The effect name - this will appear in the FORMAT window
- The string which switches the effect ON
- The string which switches the effect OFF
-
- For example:-
-
- Impress Super
- {script super}
- {script}
-
- This defines an effect called "Impress Super". When selected, verses will be
- saved like
-
- {script super}Jn 3:16{script} God so loved. . .
-
- If you load this into Impression, "Jn 3:16" will appear in superscript.
-
- Files to be searched
- ====================
-
- The book files are defined as follows:-
-
- The order of the entries controls the order in which the files will be
- searched. So normally the entries will be in the order that the books occur
- in the Bible.
-
- Each entry contains:
- The abbreviation for the book (used for references)
- The filename
- The type of the book
- The full name of the book
- separated by commas
-
- Book types are used to control the range of the search. The types correspond
- to the icons in the selection window.
- Type 1: Pentateuch Genesis-Deuteronomy
- Type 2: History Joshua-Esther
- Type 3: Poetry Job-Song of Solomon
- Type 4: Major Prophets Isaiah-Daniel
- Type 5: Minor Prophets Hosea-Malachi
- Type 6: Gospels Matthew-John
- Type 7: Acts Acts
- Type 8: Letters Romans-Jude
- Type 9: Revelation Revelation
-
- The program is very sensitive to errors in the config file, so save a copy
- first, and be careful with those commas.
-
-
-